News Thread Extraction Based on Topical N-Gram Model with a Background Distribution
نویسندگان
چکیده
Automatic thread extraction for news events can help people know different aspects of a news event. In this paper, we present a method of extraction using a topical N-gram model with a background distribution (TNB). Unlike most topic models, such as Latent Dirichlet Allocation (LDA), which relies on the bag-of-words assumption, our model treats words in their textual order. Each news report is represented as a combination of a background distribution over the corpus and a mixture distribution over hidden news threads. Thus our model can model “presidential election” of different years as a background phrase and “Obama wins” as a thread for event “2008 USA presidential election”. We apply our method on two different corpora. Evaluation based on human judgment shows that the model can generate meaningful and interpretable threads from a news corpus.
منابع مشابه
Information Extraction from Broadcast News
This paper discusses the development of trainable statistical models for extracting content from television and radio news broadcasts. In particular we concentrate on statistical finite state models for identifying proper names and other named entities in broadcast speech. Two models are presented: the first represents name class information as a word attribute; the second represents both word-...
متن کاملDynamic Language Model Adaptation Using Latent Topical Information and Automatic Transcripts
This paper investigates dynamic language model adaptation for Mandarin broadcast news recognition. A topical mixture model was presented to dynamically explore the long−span latent topical information for language model adaptation. The underlying characteristics and different kinds of model structures were extensively investigated, while their performance was verified by comparison with the con...
متن کاملLexicon Analysis Based Automatic News Classification Approach – A Review
The news classification approach is the primary approach for the online news portals with the news data sourced from the various portals. The various types of data is received and accepted over the news classification portals. The lexicon analysis plays the key role in the categorization of the news automatically using the automatic news category recognition by analyzing the keyword data extrac...
متن کاملFitting long-range information using interpolated distanced n-grams and cache models into a latent dirichlet language model for speech recognition
We propose a language modeling (LM) approach using interpolated distanced n-grams into a latent Dirichlet language model (LDLM) [1] for speech recognition. The LDLM relaxes the bag-of-words assumption and document topic extraction of latent Dirichlet allocation (LDA). It uses default background ngrams where topic information is extracted from the (n-1) history words through Dirichlet distributi...
متن کاملAkamon: An Open Source Toolkit for Tree/Forest-Based Statistical Machine Translation
We describe Akamon, an open source toolkit for tree and forest-based statistical machine translation (Liu et al., 2006; Mi et al., 2008; Mi and Huang, 2008). Akamon implements all of the algorithms required for tree/forestto-string decoding using tree-to-string translation rules: multiple-thread forest-based decoding, n-gram language model integration, beamand cube-pruning, k-best hypotheses ex...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011